Movie Rating Model and Predictor
Part 4: Exploratory data analysis
The exploratory data analysis comprised both a univariate and bivariate examination of the variables.
Univariate Analysis
Bivariate Analysis
Dependent Variable
As mentioned above, the first objective was to identify an available variable that would proxy for box office success.
Table 1: Variables most highly correlated with the log box office revenue| Variable | Correlation | Statistic | df | p-value | 95% CI |
|---|---|---|---|---|---|
| imdb_num_votes_log | 0.67 | 16.07 | 310 | p < .001 | [ 0.61 , 0.73 ] |
| votes_per_day_log | 0.55 | 11.61 | 310 | p < .001 | [ 0.47 , 0.62 ] |
| imdb_num_votes | 0.46 | 9.07 | 310 | p < .001 | [ 0.37 , 0.54 ] |
| cast_votes_log | 0.40 | 7.76 | 310 | p < .001 | [ 0.31 , 0.49 ] |
| cast_scores | 0.34 | 6.28 | 310 | p < .001 | [ 0.23 , 0.43 ] |
| cast_experience_log | 0.34 | 6.28 | 310 | p < .001 | [ 0.23 , 0.43 ] |
| cast_scores_log | 0.32 | 6.06 | 310 | p < .001 | [ 0.22 , 0.42 ] |
| cast_experience | 0.32 | 5.87 | 310 | p < .001 | [ 0.21 , 0.41 ] |
| votes_per_day | 0.31 | 5.67 | 310 | p < .001 | [ 0.2 , 0.4 ] |
| runtime | 0.28 | 5.21 | 310 | p < .001 | [ 0.18 , 0.38 ] |
Pearson product-moment correlation coefficients were computed to determine which of the available variables most highly correlated with the log of box office revenue. Table 1 reveals the log number of IMDB votes, r = 0.674, n = 312, p = p < .001 as having the highest correlation with log box office revenue. A scatterplot summarizes the results (Figure 1) Overall, there was a strong, positive correlation between the log number of IMDB votes and the log box office.
Figure 1: Log Box Office vs. Log IMDB Votes
Predictors
Having designated the log number of IMDB votes as the dependent variable, an analysis was conducted on the training set to determine the correlations between available predictors and log IMDB votes.
Table 2: Predictor correlations with log box office revenue| Variable | Correlation | Statistic | df | p-value | 95% CI |
|---|---|---|---|---|---|
| cast_scores | 0.42 | 8.90 | 372 | p < .001 | [ 0.33 , 0.5 ] |
| cast_votes_log | 0.42 | 8.88 | 372 | p < .001 | [ 0.33 , 0.5 ] |
| runtime_log | 0.39 | 8.18 | 372 | p < .001 | [ 0.3 , 0.47 ] |
| cast_scores_log | 0.39 | 8.17 | 372 | p < .001 | [ 0.3 , 0.47 ] |
| runtime | 0.38 | 7.86 | 372 | p < .001 | [ 0.29 , 0.46 ] |
| cast_votes | 0.36 | 7.43 | 372 | p < .001 | [ 0.27 , 0.44 ] |
| cast_experience_log | 0.32 | 6.44 | 372 | p < .001 | [ 0.22 , 0.41 ] |
| cast_experience | 0.31 | 6.36 | 372 | p < .001 | [ 0.22 , 0.4 ] |
| director_experience_log | 0.25 | 5.06 | 372 | p < .001 | [ 0.16 , 0.35 ] |
| director_experience | 0.23 | 4.47 | 372 | p < .001 | [ 0.13 , 0.32 ] |
| imdb_rating | 0.19 | 3.77 | 372 | p < .001 | [ 0.09 , 0.29 ] |
| audience_score | 0.16 | 3.11 | 372 | p < .01 | [ 0.06 , 0.26 ] |
| critics_score | 0.04 | 0.75 | 372 | 0.456 | [ -0.06 , 0.14 ] |
Table 2 summarizes the results of several Pearson product-moment correlation tests. Indicators of cast popularity in terms of votes and scores were the most highly correlated predictors of log IMDB votes. The relatively high correlation with runtime was unexpected.
The next section describes two linear models: (1) a multiregression model to predict log IMDB votes, and (2) a simple linear regression model to predict log box office revenue based upon the log of IMDB votes.
Part 5: Modeling
The aim at this stage was to develop two prediction models: Model One, a multiregression model that predicts the log of IMDB votes, and Model Two, a simple linear regression model that predicts log box office revenue based upon the log of IMDB votes. The former was the best performing of four multiregression models, developed using both forward selection and backward elimination method selection methods. These four models and their model selection methods were:
Table 3: Multiregression prediction models| Model | Model.Selection | Data |
|---|---|---|
| Alpha | Forward Selection | Full model |
| Beta | Forward Selection | Full model, influential outliers removed |
| Gamma | Backward Elimination | Full model |
| Delta | Backward Elimination | Full model, influential outliers removed |
The remainder of this sections is organized as follows.
Model One: Multiregression Model
1.1. Model Selection Methods
1.2. Full Model
1.3. Model Alpha
1.4. Model Beta
1.5. Model Gamma
1.6. Model Delta
1.7. Model Comparison
1.8. Model Two: Final Multiregression ModelModel Two: Simple Linear Regression Model
2.1. Model Overview
2.2. Model DiagnosticsModel Summary
Model One: Multiple Linear Regression
Model One was the best performing of models Alpha, Beta, Gamma, and Delta. The following provides an overview of the model selection methods used, then each model is described and diagnosed vis-a-vis assumptions of linearity, homoscedasticity, normality of errors, multicollinearity, and the treatment of influential points.
Model Selection Methods
Both forward selection and backward elimination model selection techniques were used. The forward selection approach optimized adjusted r-squared; whereas the backward elimination method was based upon p-values.
Forward Selection
The forward selection process began with a null model then all variables were added to the model, one-by-one, and the model which provided the greatest improvement over the current best adjusted R-squared was selected. The process repeated with each variable that was not already in the model until all variables were analyzed. Only the models that improved adjusted r-squared were retained at each step.
Backward Elimination
The backward elimination approach began with the full model. A regression analysis was performed and the least significant predictor (that with the highest p-value) was removed from the model. This process repeated, removing only the most least significant predictor at each step, until all predictors had p-values below the designated \(\alpha = .05\) threshold.
Full Model Selection
The motivation for this analysis was provide insight to inform the decisions studio executives must make at the early stages of a film project. That said, the variables selected for the full model were those that are “knowable” by studio executives before theatrical release. Consequently variables excluded from consideration as predictors consist of:
* variables not “knowable” at project inception. This would include film ratings, scores, academy awards, and box office results * variables that are redundant with other variables with higher correlations with the dependent variable. * categorical variables with levels containing fewer than 5 observations, such as the actor, director and studio variables * dvd release dates as dvd sales are out of the scope of this analysis.
As such, the full model is presented in Table 4.
Table 4: Variables in Full Model| Type | Variable | Description |
|---|---|---|
| Categorical | genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| Categorical | mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| Numeric | cast_scores | Total number of allocated, previously earned composite score points for the cast of a film |
| Numeric | cast_votes | Total number of allocated, previously earned IMDB votes for the cast of a film |
| Numeric | director_experience_log | Log of the total number of films directed by the film’s director |
| Numeric | runtime_log | Log runtime of movie (in log minutes) |
The following sections explore various models, model selection techniques, and model diagnostics. Comparisons were conducted and the models were evaluated on test data for prediction accuracy and stability. Lastly, the best performing model is selected and described on detail.
Model Alpha
For this model, a forward selection procedure was undertaken based upon the full model described above. Table 5 lists the variables in the order in which they were added.
Table 5: Model Alpha forward selection process| Step | Selected | Model.Size | DF | F.statistic | R.Squared | Adjusted.R2 | p.value | Pct Chg |
|---|---|---|---|---|---|---|---|---|
| 1 | genre | 1 | 11 363 | 11.19 | 0.24 | 0.22 | 0 | 0.00 |
| 2 | cast_scores | 2 | 12 362 | 18.87 | 0.36 | 0.34 | 0 | 60.47 |
| 3 | director_experience_log | 3 | 13 361 | 18.59 | 0.38 | 0.36 | 0 | 4.64 |
| 4 | mpaa_rating | 4 | 17 357 | 14.63 | 0.40 | 0.37 | 0 | 2.22 |
As indicated in Table 6 and graphically depicted in Figure 2, the model was significant (F(17, 357) = 14.633, p < .001), with an adjusted R-squared of 0.369.
Table 6: Model Alpha Summary Statistics| Model | Size | df | df Residuals | F Statistic | RMSE | Residual SE | R-Squared | Adj R-Squared | p-value | % Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Alpha | 4 | 17 | 357 | 14.633 | 1.912 | 1.912 | 0.396 | 0.369 | 0 | 39.607 |
Figure 2 Model Alpha Regression
Model Diagnostics
Linearity
The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 3.
Figure 3 Model Alpha linearity plots
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(17), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 4) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 4 Model Alpha homoscedasticity plot
The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.371). As such the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 5 illustrate the distribution of residuals.
Figure 5 Model Alpha residuals plot
The histogram and normal Q-Q plot suggested a nearly normal distribution of residuals. A review of the Shapiro-Wilk test (alpha = 0.001, SW = 0.994, p = 0.118) and the skewness (-0.019) and kurtosis (2.561) supported the assumption of normaility.
Multicollinearity
As shown in Figure 6 and Table 7, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 2.7 did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 6: Model Alpha correlations among quantitative predictors
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| genre | 2.685 | 10 | 1.051 |
| cast_scores | 1.286 | 1 | 1.134 |
| director_experience_log | 1.118 | 1 | 1.058 |
| mpaa_rating | 2.455 | 4 | 1.119 |
Upon further analysis, the high VIF for genre and MPAA rating was a consequence of the reference categories having a small proportion of the overall cases. The reference category for genre, Action & Adventure, had just 39 films, 10.4% of the cases. The reference category for MPAA rating was G, consisting of 5 observations, 1.3% of the cases. Though the p-values for the indicator variables may be high, the overall test that all indicators have coefficients of zero is unaffected by the high VIFs
Outliers
Figure 7 Model Alpha Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 21 cases exerting undue influence on the model. The discern the effect of these outliers on the model, a new model (Model B) was created without the outliers removed.
Model Beta
This was also a forward selecion model; however, it was based upon the full model with outliers from Model Alpha removed. The variables were added as described in Table 8
Table 8: Model Beta forward selection process| Step | Selected | Model.Size | DF | F.statistic | R.Squared | Adjusted.R2 | p.value | Pct Chg |
|---|---|---|---|---|---|---|---|---|
| 1 | genre | 1 | 11 342 | 11.53 | 0.25 | 0.23 | 0 | 0.00 |
| 2 | cast_scores | 2 | 12 341 | 21.13 | 0.40 | 0.39 | 0 | 67.83 |
| 3 | director_experience_log | 3 | 13 340 | 21.12 | 0.43 | 0.41 | 0 | 5.44 |
| 4 | mpaa_rating | 4 | 17 336 | 16.14 | 0.44 | 0.41 | 0 | 0.25 |
| 5 | thtr_rel_month | 5 | 28 325 | 10.01 | 0.45 | 0.41 | 0 | 0.25 |
As indicated in Table 9 and graphically depicted in Figure 8, the model was significant (F(28, 325) = 10.012, p < .001), with an adjusted R-squared of 0.409.
Table 9: Model Beta Summary Statistics| Model | Size | df | df Residuals | F Statistic | RMSE | Residual SE | R-Squared | Adj R-Squared | p-value | % Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Beta | 5 | 28 | 325 | 10.012 | 1.83 | 1.83 | 0.454 | 0.409 | 0 | 45.408 |
Figure 8 Model Beta Regression
Model Diagnostics
Linearity
The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 9.
Figure 9 Model Beta linearity plots
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(28), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 10) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 10 Model Beta homoscedasticity plot
The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.276). As such the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 11 illustrate the distribution of residuals.
Figure 11 Model Beta residuals plot
The histogram and normal Q-Q plot suggested a nearly normal distribution of residuals. A review of the Shapiro-Wilk test (alpha = 0.001, SW = 0.994, p = 0.195) and the skewness (0.043) and kurtosis (2.641) supported the assumption of normaility.
Multicollinearity
As shown in Figure 12 and Table 10, collinearity appeared extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 4.1 exceeded the threshold of 4. As such, the correlation among the predictors would require further consideration.
Figure 12: Correlations among quantitative predictors
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| genre | 4.128 | 10 | 1.073 |
| cast_scores | 1.360 | 1 | 1.166 |
| director_experience_log | 1.135 | 1 | 1.065 |
| mpaa_rating | 3.297 | 4 | 1.161 |
| thtr_rel_month | 1.640 | 11 | 1.023 |
Upon further analysis, the high VIF for genre and MPAA rating was a consequence of the reference categories having a small proportion of the overall cases. The reference category for genre, Action & Adventure, had just 39 films, 10.4% of the cases. The reference category for MPAA rating was G, consisting of 5 observations, 1.3% of the cases. Though the p-values for the indicator variables may be high, the overall test that all indicators have coefficients of zero is unaffected by the high VIFs
Outliers
Figure 13 Model Beta Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 15 cases exerting undue influence on the model. A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points would not be removed from the model.
Model Gamma
For this model, a backward elimination procedure was undertaken based upon the full model The variables were removed as described in Table 11
Table 11: Model Gamma| Steps | Removed | p.value |
|---|---|---|
| 1 | cast_votes_log | 0.70 |
| 2 | thtr_rel_month | 0.09 |
The model therefore retained the following variables:
Table 12 Model Gamma Variables| Variable | Description |
|---|---|
| genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| director_experience_log | Log of the total number of films directed by the film’s director |
| cast_scores | Total number of allocated, previously earned composite score points for the cast of a film |
As indicated in Table 13 and graphically depicted in Figure 14, the model was significant (F(17, 357) = 14.633, p < .001), with an adjusted R-squared of 0.369.
Table 13 Model Gamma Summary Statistics| Model | Size | df | df Residuals | F Statistic | RMSE | Residual SE | R-Squared | Adj R-Squared | p-value | % Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Gamma | 4 | 17 | 357 | 14.633 | 1.912 | 1.912 | 0.396 | 0.369 | 0 | 39.607 |
Figure 14 Model Gamma Regression
Model Diagnostics
Linearity
The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 15.
Figure 15 Model Gamma linearity plots
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(17), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 16) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 16 Model Gamma homoscedasticity plot
The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.371). As such the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 17 illustrate the distribution of residuals.
Figure 17 Model Gamma residuals plot
The histogram and normal Q-Q plot suggested a nearly normal distribution of residuals. A review of the Shapiro-Wilk test (alpha = 0.001, SW = 0.994, p = 0.118) and the skewness (-0.019) and kurtosis (2.561) supported the assumption of normaility.
Multicollinearity
As shown in Figure 18 and Table 14, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 2.7 did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 18: Correlations among quantitative predictors
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| genre | 2.685 | 10 | 1.051 |
| mpaa_rating | 2.455 | 4 | 1.119 |
| cast_scores | 1.286 | 1 | 1.134 |
| director_experience_log | 1.118 | 1 | 1.058 |
Upon further analysis, the high VIF for genre and MPAA rating was a consequence of the reference categories having a small proportion of the overall cases. The reference category for genre, Action & Adventure, had just 39 films, 10.4% of the cases. The reference category for MPAA rating was G, consisting of 5 observations, 1.3% of the cases. Though the p-values for the indicator variables may be high, the overall test that all indicators have coefficients of zero is unaffected by the high VIFs
Outliers
Figure 19 Model Gamma Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 21 cases exerting undue influence on the model. To discern the effect of the influential points on the model, a new model (Model Delta) was created without the influential points of this model.
Model Delta
This was also a backward elimination model; however, it was based upon the full model with outliers from Model Gamma removed. The variables were removed as described in Table 15
Table 15: Model Delta| Steps | Removed | p.value |
|---|---|---|
| 1 | cast_votes_log | 0.52 |
| 2 | thtr_rel_month | 0.10 |
The model therefore retained the following variables:
Table 16 Model Delta Variables| Variable | Description |
|---|---|
| genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) |
| director_experience_log | Log of the total number of films directed by the film’s director |
| cast_scores | Total number of allocated, previously earned composite score points for the cast of a film |
As indicated in Table 17 and graphically depicted in Figure 20, the model was significant (F(17, 336) = 16.144, p < .001), with an adjusted R-squared of 0.408.
Table 17| Model | Size | df | df Residuals | F Statistic | RMSE | Residual SE | R-Squared | Adj R-Squared | p-value | % Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Delta | 4 | 17 | 336 | 16.144 | 1.832 | 1.832 | 0.435 | 0.408 | 0 | 43.463 |
Figure 20 Model Delta Regression
Model Diagnostics
Linearity
The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 21.
Figure 21 Model Delta linearity plots
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(17), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 22) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 22 Model Delta homoscedasticity plot
The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.373). As such the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 23 illustrate the distribution of residuals.
Figure 23 Model Delta residuals plot
The histogram and normal Q-Q plot suggested a nearly normal distribution of residuals. A review of the Shapiro-Wilk test (alpha = 0.001, SW = 0.992, p = 0.064) and the skewness (0.039) and kurtosis (2.642) supported the assumption of normaility.
Multicollinearity
As shown in Figure 24 and Table 18, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of 3.1 did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 24: Correlations among quantitative predictors
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| genre | 3.087 | 10 | 1.058 |
| mpaa_rating | 2.855 | 4 | 1.140 |
| cast_scores | 1.288 | 1 | 1.135 |
| director_experience_log | 1.105 | 1 | 1.051 |
Upon further analysis, the high VIF for genre and MPAA rating was a consequence of the reference categories having a small proportion of the overall cases. The reference category for genre, Action & Adventure, had just 39 films, 10.4% of the cases. The reference category for MPAA rating was G, consisting of 5 observations, 1.3% of the cases. Though the p-values for the indicator variables may be high, the overall test that all indicators have coefficients of zero is unaffected by the high VIFs
Outliers
Figure 25 Model Delta Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 15 cases exerting undue influence on the model. A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points would not be removed from the model.
Model Comparisons
To summarize, models Alpha and Beta were constructed using forward selection and models Gamma and Delta were developed via backward elimination. Models Beta and Delta were fitted without the influential data points from models Alpha and Gamma respectively.
Table 19 Summary of models| Model | Size | df | df Residuals | F Statistic | RMSE | Residual SE | R-Squared | Adj R-Squared | p-value | % Variance |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Alpha | 4 | 17 | 357 | 14.633 | 1.912 | 1.912 | 0.396 | 0.369 | 0 | 39.607 |
| Model Beta | 5 | 28 | 325 | 10.012 | 1.830 | 1.830 | 0.454 | 0.409 | 0 | 45.408 |
| Model Gamma | 4 | 17 | 357 | 14.633 | 1.912 | 1.912 | 0.396 | 0.369 | 0 | 39.607 |
| Model Delta | 4 | 17 | 336 | 16.144 | 1.832 | 1.832 | 0.435 | 0.408 | 0 | 43.463 |
Forward Selection vs. Backward Elimination
The differences in root mean square error for the models was not significant 0% and -0.09%. Similarly, the differences in adjusted R-squared were 0% and -0.25%, not a significant difference. Lastly the differences in the percent variance explained by the models also lacking in significance (0% and -4.28%).
Influential Points: Drop or Not
The Beta and Delta models were trained on data sans the influential points from Alpha and Gamma. The differences in RMSE (4.49% and 4.4%) were insignificant, as were the differences in adjusted R-squared (10.77% and 10.49%), and the percent of variance explained (14.65% and 9.74%). However, a case-wise review of the influential points did not reveal any data quality issues; therefore, the points would not be removed.
Prediction Accuracy
The evaluate the effects of model selection method and the treatment of outliers on prediction accuracy, the four multiregression models were evaluated for prediction accuracy on the test data. Four measures of prediction accuracy were used:
- MAPE - Mean Absolute Percentage Error
- MPE - Mean Percentage Error
- MSE - Mean Squared Error
- RMSE - Root Mean Squared Error
In addition, a percent accuracy measure was computed as the percentage of the observations in the test set in which the actual log number of IMDB votes fell within the prediction interval.
Table 20 Model Predictive Accuracy Summary| Model | Size | F Statistic | R-Squared | Adj R-Squared | % Variance | MAPE | MPE | MSE | RMSE | % Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|
| Model Alpha | 4 | 14.633 | 0.396 | 0.369 | 39.607 | 12.215 | -0.688 | 4.040 | 2.010 | 97.872 |
| Model Beta | 5 | 10.012 | 0.454 | 0.409 | 45.408 | 11.842 | -0.989 | 3.862 | 1.965 | 96.809 |
| Model Gamma | 4 | 14.633 | 0.396 | 0.369 | 39.607 | 12.215 | -0.688 | 4.040 | 2.010 | 97.872 |
| Model Delta | 4 | 16.144 | 0.435 | 0.408 | 43.463 | 12.001 | -0.768 | 3.916 | 1.979 | 98.936 |
There were no significant differences in MAPE, MSE, and RMSE between the models. The negative MPE indicated that all models were biased with over predictions. The models with the influence points had slightly greater prediction accuracy though the models without the outliers had higher adjusted coefficients of determination. Given the lack of justification for excluding the influential points, models Beta and Delta were ruled out. Though models Alpha and Gamma performed identically, Alpha was selected as the most parsimonious model since two predictors accounted for 36.4% of total variation and NA% of variation allocated to terms.
Model Two: Final Multiregression Model
The final prediction equation was defined as follows: \(y_i\) = 12.79953 + -0.299\(x_1\) + -2.135\(x_2\) + -0.997\(x_3\) + -3.52\(x_4\) + -1.248\(x_5\) + -0.269\(x_6\) + -1.464\(x_7\) + -0.682\(x_8\) + 0.573\(x_9\) + 1.464\(x_{10}\) + 1.101\(x_{11}\) + 1.391\(x_{12}\) + 0.968\(x_{13}\) + 0.052\(x_{14}\) + 0.003\(x_{15}\) + 0.656\(x_{16}\) + \(\epsilon\)
where: \(x_1\) is genreAnimation
\(x_2\) is genreArt House & International
\(x_3\) is genreComedy
\(x_4\) is genreDocumentary
\(x_5\) is genreDrama
\(x_6\) is genreHorror
\(x_7\) is genreMusical & Performing Arts
\(x_8\) is genreMystery & Suspense
\(x_9\) is genreOther
\(x_{10}\) is genreScience Fiction & Fantasy
\(x_{11}\) is mpaa_ratingPG
\(x_{12}\) is mpaa_ratingPG-13
\(x_{13}\) is mpaa_ratingR
\(x_{14}\) is mpaa_ratingUnrated
\(x_{15}\) is cast_scores
\(x_{16}\) is director_experience_log
The genre, MPAA rating and month of release variables were code 0 or 1 in accordance with the genre, MPAA rating and month of release for each observation.
Analysis of Variance
Figure 26 summarizes the analysis of variance.| Term | Df | Sum Sq | Mean Sq | F Statistic | Pr(>F) | % Var |
|---|---|---|---|---|---|---|
| genre | 10 | 509.207 | 50.921 | 13.925 | 0.000 | 23.56 |
| mpaa_rating | 4 | 71.458 | 17.864 | 4.885 | 0.001 | 3.31 |
| cast_scores | 1 | 236.670 | 236.670 | 64.720 | 0.000 | 10.95 |
| director_experience_log | 1 | 38.831 | 38.831 | 10.619 | 0.001 | 1.80 |
| Residuals | 357 | 1305.485 | 3.657 | NA | NA | 60.39 |
Figure 26 Model Alpha analysis of variance
A two-way analysis of variance was conducted on the influence of 4 independent variables on the log imdb votes. The force of genre on the log imdb votes produced an F statistic of F(10, 357), = 13.925, p < .001, representing 23.56% of the variance. The significance of mpaa_rating on the log imdb votes indicated an F statistic of F(4, 357), = 4.885, p < .001, accounting for 3.31% of the variance. The significance of cast_scores on the log imdb votes yielded an F statistic of F(1, 357), = 64.72, p < .001, representing 10.95% of the variance. The influence of director_experience_log on the log imdb votes yielded an F statistic of F(1, 357), = 10.619, p < .01, expressing 1.8% of the variance. Finally, residuals exhibited approximately 60.39% of variance. The model was significant (F(17, 357) = 14.633, p < .001), with an adjusted R-squared of 0.369.
Interpretation of Coefficients
Although there were only 2 variables, there were some 17 coefficients, a consequence of the number of levels in the categorical variables. The coefficients estimates are identified in Table 21.
Table 21: Model Alpha Coefficients| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 12.800 | 1.041 | 12.293 | 0.000 |
| genreAnimation | -0.299 | 1.137 | -0.263 | 0.792 |
| genreArt House & International | -2.135 | 0.766 | -2.786 | 0.006 |
| genreComedy | -0.997 | 0.404 | -2.466 | 0.014 |
| genreDocumentary | -3.520 | 0.529 | -6.651 | 0.000 |
| genreDrama | -1.248 | 0.352 | -3.542 | 0.000 |
| genreHorror | -0.269 | 0.695 | -0.387 | 0.699 |
| genreMusical & Performing Arts | -1.464 | 0.793 | -1.846 | 0.066 |
| genreMystery & Suspense | -0.682 | 0.454 | -1.501 | 0.134 |
| genreOther | 0.573 | 0.940 | 0.609 | 0.543 |
| genreScience Fiction & Fantasy | 1.464 | 0.930 | 1.574 | 0.116 |
| mpaa_ratingPG | 1.101 | 1.008 | 1.092 | 0.276 |
| mpaa_ratingPG-13 | 1.391 | 1.017 | 1.367 | 0.172 |
| mpaa_ratingR | 0.968 | 1.003 | 0.965 | 0.335 |
| mpaa_ratingUnrated | 0.052 | 1.061 | 0.049 | 0.961 |
| cast_scores | 0.003 | 0.000 | 7.127 | 0.000 |
| director_experience_log | 0.656 | 0.201 | 3.259 | 0.001 |
The intercept estimate, 12.8 , is the regression estimate for the mean log number of IMDB votes for a G-rated action and adventure film, with zeros for all of the other variables. The other coefficient estimates adjust the estimate accordingly. Therefore a prediction for the log number of IMDB votes was equal to:
* the intercept value, 12.8,
* plus 0.003 log IMDB votes for each point score for the cast members,
* plus a number of log IMDB votes associated with the genre of the film,
* plus 0.656 log IMDB votes for the log unit of films directed by the film’s director,
* plus a number of log IMDB votes associated with the MPAA rating for the film,
It should be noted that the influence of cast votes and the log was not significant.
Model Two: Simple Linear Regression
Model Overview
A simple linear regression was calculated to predict the log of box office revenue based upon the log of the number of IMDB votes. A significant regression equation was found (F(2,222) = 356.611, p < .001), with an \(R^2\) of 0.616. The prediction equation is as follows:
\(y_i\) = 9.13 + 0.97\(x_1\) + \(\epsilon\)
where:
\(x_1\) is imdb_num_votes_log
| Term | Df | Sum Sq | Mean Sq | F Statistic | Pr(>F) | % Var |
|---|---|---|---|---|---|---|
| imdb_num_votes_log | 1 | 667.10 | 667.10 | 356.61 | 0 | 61.63 |
| Residuals | 222 | 415.29 | 1.87 | NA | NA | 38.37 |
Model Diagnostics
Linearity
The linearity of the predictor with the log of box office is illustrated in Figure 27.
Figure 27 Model Two linearity plot
A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(2), p < .001). As such, the linearity assumption was met in this case.
Homoscedasticity
The following plot (Figure 28) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 28 Model Two homoscedasticity plot
The residuals plot above indicated equal dispersion disperson of residuals about the mean. A Breusch-Pagan test was conducted to test the homoscedasticity assumption. The results were not significant (F(1), p = 0.052). As such the homoscedasticity assumption was met in this case.
Residuals
The histogram and the normal Q-Q plot in Figure 29 illustrate the distribution of residuals.
Figure 29 Model Two residuals plot
The histogram and normal Q-Q plot suggested a nearly normal distribution of residuals. A review of the Shapiro-Wilk test (alpha = 0.001, SW = 0.983, p = 0.009) and the skewness (-0.279) and kurtosis (2.386) supported the assumption of normaility.
Outliers
Figure 30 Model Two Outliers
Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 8 cases exerting undue influence on the model. A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points were not removed from the model.
Model Summary
The purpose of this section was to develop models that would predict “box office success”. Therefore, two regression models were fit in this section: Model One (F(17, 357) = 14.63, p < .001) was selected from among four multiregression linear models employing forward selection and backward elimination algorithms. Model Two, a simple linear regression model (F(2, 222) = 356.61, p < .001) predicts log box office revenue based upon the log IMDB votes.
Next, the models will be used to predict the number of log IMDB votes and the log box office for a randomly selected film.
Part 6: Prediction
Model Alpha regression model (F=(17,357) = 14.63 p-value < .001) was used to predict the imdb_num_votes_log of the film ‘Pootie Tang’. The actual value was 13.45 log IMDB votes. The model predicted 14.47 log IMDB votes with a 95% prediction interval CI[11.28,17.67] log IMDB votes, equating to an over prediction of 7.63%.
Part 7: Conclusion
The purpose of this analysis was to determine the factors that make a movie popular, in terms of box office success. More to the point, the aim was to generate insight that would inform the decisions studio executives must make at project inception or in the early stages of a film project. That said, features of interest were those that were knowable by studio executives at or near project outset. As such, variables such as critics scores/ratings, audience scores/ratings, IMDB rating, academy awards, and box office success, were only considered as possible dependent variables to predict, but not as independent variables. Instead, new features that characterized actor and director experience and popularity from a historical point of view, were created. Given the available data, three key observations, relating to indicators and predictors of movie success, emerged as noteworthy.
Votes
Revenue data was obtained for a random sample of films from the original sample and a simple linear regression model was trained to determine which of the available features was most highly correlated with (log) ** box office revenue. The log number of IMDB votes emerged as the most highly correlated with box office success, by far.
Table 23: Predictors of box office revenue| Variable | Correlation | Statistic | df | p-value | 95% CI | |
|---|---|---|---|---|---|---|
| 1 | imdb_num_votes_log | 0.67 | 16.07 | 310 | p < .001 | [ 0.61 , 0.73 ] |
| 3 | imdb_num_votes | 0.46 | 9.07 | 310 | p < .001 | [ 0.37 , 0.54 ] |
| 10 | runtime | 0.28 | 5.21 | 310 | p < .001 | [ 0.18 , 0.38 ] |
| 11 | runtime_log | 0.28 | 5.22 | 310 | p < .001 | [ 0.18 , 0.38 ] |
| 15 | thtr_days | 0.18 | 3.16 | 310 | p < .01 | [ 0.07 , 0.28 ] |
| 16 | critics_score | -0.14 | -2.59 | 310 | p < .05 | [ -0.25 , -0.03 ] |
| 17 | thtr_days_log | 0.14 | 2.52 | 310 | p < .05 | [ 0.03 , 0.25 ] |
Table 23 reveals a correlation of 0.55 for the log number of IMDB votes. Note that movie runtime emerged as a better predictor of (log) box office revenue than critics score, imdb rating and audience scores. In fact they were negatively correlated with (log) box office revenue. This was not expected.
Genre
The multiregression analysis showed that genre accounts for 23.56% of total variance and NA% of the variance attributed to the terms. Figure 31 reveals (log) IMDB votes and (log) box office by genre, with all other values (except the intercept) set to zero.
Figure 31: Genre Analysis
In order of influence, action and adventure films are the most popular, followed by science fiction & fantasy and horror. The effect on the (log) number of votes was linear by genre; however, the revenue effect was nearly exponential. All else equal, action and adventure, science fiction & fantasy, and horror films have a revenue potential 224.9%, 78.1%, 21.1% greater than the mean, respectively.
Cast
In terms of box office success, casting is likely the single most important decision to be made during project initiation. To characterize actor popularity, a cast votes variable was created to accumulated scores earned over time. Derived from a composite total score equal to 10 x the IMDB rating plus the audience score, total film score was apportioned among the five credited actors as follows:
* Actor 1: 40%
* Actor 2: 30%
* Actor 3: 15%
* Actor 4: 10%
* Actor 5: 5%
Scores were maintained for each actor and the cast score was the sum of the actor scores for that film. This measure accounted for 12.89% of total variance and NA% of the variance attributed to the terms.
Figure 32 illustrates the effect of cast scores on the log of IMDB votes, and box office revenue.
Figure 32: Cast Scores Analysis
The (log) number of IMDB votes grows linearly with cast scores; however, box office grows at an increasing rate as cast popularity increases. A one fold increase in cast scores from 500 to 1000 produces a 0 fold increase in box office revenue potential. However, doubling cast scores to 2000 points produces a 0 fold increase in revenue potential. The return on the investment in cast is substantial.